Skip to content

Add hi-IN , Ko-KR and pt-BR IPA tokenizer support#15567

Open
quapham wants to merge 19 commits intoNVIDIA-NeMo:mainfrom
quapham:hi_pt_BR_ipa
Open

Add hi-IN , Ko-KR and pt-BR IPA tokenizer support#15567
quapham wants to merge 19 commits intoNVIDIA-NeMo:mainfrom
quapham:hi_pt_BR_ipa

Conversation

@quapham
Copy link
Copy Markdown
Contributor

@quapham quapham commented Mar 31, 2026

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Extends the IPAG2p tokenizer to support Hindi (hi-IN) with English code-switching , Korean (ko-KR) and Brazilian Portuguese (pt-BR) locale.

Collection: [Note which collection this PR will affect]
tts, common

Changelog

  • Add hi-IN, ko-KR and pt-BR to SUPPORTED_LOCALES in ipa_lexicon.py
  • Add INDIC_CHARS_ALL support to tokenizer_utils.py and IpaG2p (enables Indic script dict parsing)
  • Extend IpaG2p._parse_phoneme_dict() to accept a list of dicts enabling multi-dict code-switching (e.g. Hindi + English)
  • Add pronunciation dict files: hi_prondict-v0.1.dict (hi_IN) and pt_br_prondict-v1.0.dict (pt_BR), ko_prondict-v1.0.dict
  • Add unit tests: test_ipa_tokenizer_hi_in (hi-IN/en code-switching), test_ipa_ko_kr and test_ipa_tokenizer_pt_br

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • [x ] Make sure you read and followed Contributor guidelines
  • [x ] Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@quapham quapham changed the title Add hi-IN and pt-BR IPA tokenizer support Add hi-IN , Ko-KR and pt-BR IPA tokenizer support Apr 16, 2026
@blisc blisc added the Run CICD label Apr 16, 2026
@blisc
Copy link
Copy Markdown
Collaborator

blisc commented Apr 16, 2026

Can you fix the linting and sign off issues?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Extends the NeMo TTS IPA G2P/tokenizer stack to better handle additional scripts/locales (Hindi with English code-switching, Korean, and Brazilian Portuguese) by expanding tokenization character coverage, dictionary parsing, and adding unit tests to validate expected IPA outputs.

Changes:

  • Added unit tests for IPA tokenization in pt-BR, hi-IN (Hindi/English code-switching), and ko-KR.
  • Expanded “any-locale” tokenization character coverage to include Indic and Korean Unicode ranges.
  • Updated IpaG2p dictionary parsing and regex handling to accept Indic/Korean words and merge multiple dictionaries.

Reviewed changes

Copilot reviewed 4 out of 7 changed files in this pull request and generated 3 comments.

File Description
tests/collections/common/tokenizers/text_to_speech/test_tts_tokenizers.py Adds unit tests and small in-test pronunciation dictionaries for pt-BR, hi-IN code-switching, and ko-KR.
nemo/collections/tts/g2p/models/i18n_ipa.py Extends IpaG2p regex + dictionary parsing to support Indic/Korean and multi-dict merging for code-switching.
nemo/collections/common/tokenizers/text_to_speech/tokenizer_utils.py Adds Indic and Korean Unicode ranges and expands any-locale word tokenization regex accordingly.
nemo/collections/common/tokenizers/text_to_speech/ipa_lexicon.py Adds pt-BR and ko-KR to supported locales and extends punctuation sets for hi-IN and ko-KR.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread nemo/collections/tts/g2p/models/i18n_ipa.py
Comment thread nemo/collections/common/tokenizers/text_to_speech/ipa_lexicon.py Outdated
Comment on lines 47 to 51
def __init__(
self,
phoneme_dict: Union[str, pathlib.Path, Dict[str, List[List[str]]]],
# phoneme_dict: Union[str, pathlib.Path, Dict[str, List[List[str]]]],
phoneme_dict: Union[str, pathlib.Path, List[Union[str, pathlib.Path]], Dict[str, List[List[str]]]],
locale: str = "en-US",
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phoneme_dict type annotation doesn't match the supported runtime behavior: the Hindi unit test passes a list of dicts for code-switching, but the annotation only allows List[Union[str, Path]]. This will trip static type checking and makes the API contract unclear; broaden the union to allow lists containing dicts (or use a Sequence[...]) and update the parameter docstring accordingly (also remove the stale commented-out type line).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

@XuesongYang XuesongYang Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with Copilot's comment. Need to remove stale commented-out type line and fix typing. This appears in three places: __init__, _parse_phoneme_dict, and replace_dict.

The type List[Union[str, pathlib.Path]] doesn't reflect the actual runtime behavior. The Hindi test passes [self.PHONEME_DICT_HI, self.PHONEME_DICT_EN], which is a list of dicts. The recursive call in _parse_phoneme_dict handles this correctly at runtime, but the type annotation is misleading.

Suggested change
def __init__(
self,
phoneme_dict: Union[str, pathlib.Path, Dict[str, List[List[str]]]],
# phoneme_dict: Union[str, pathlib.Path, Dict[str, List[List[str]]]],
phoneme_dict: Union[str, pathlib.Path, List[Union[str, pathlib.Path]], Dict[str, List[List[str]]]],
locale: str = "en-US",
def __init__(
self,
phoneme_dict: Union[
str, pathlib.Path, List[Union[str, pathlib.Path, Dict[str, List[List[str]]]]], Dict[str, List[List[str]]]
],

quapham and others added 8 commits April 20, 2026 12:04
Signed-off-by: quanpham <youngkwan199@gmail.com>
Signed-off-by: quapham <quapham@users.noreply.github.com>
Signed-off-by: quanpham <youngkwan199@gmail.com>
Signed-off-by: quanpham <youngkwan199@gmail.com>
Signed-off-by: quapham <quapham@users.noreply.github.com>
Signed-off-by: quanpham <youngkwan199@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
@staticmethod
def _parse_phoneme_dict(
phoneme_dict: Union[str, pathlib.Path, Dict[str, List[List[str]]]]
phoneme_dict: Union[str, pathlib.Path, List[Union[str, pathlib.Path]], Dict[str, List[List[str]]]]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
phoneme_dict: Union[str, pathlib.Path, List[Union[str, pathlib.Path]], Dict[str, List[List[str]]]]
phoneme_dict: Union[
str, pathlib.Path, List[Union[str, pathlib.Path, Dict[str, List[List[str]]]]], Dict[str, List[List[str]]]
],


def replace_dict(self, phoneme_dict: Union[str, pathlib.Path, Dict[str, List[List[str]]]]):
def replace_dict(
self, phoneme_dict: Union[str, pathlib.Path, List[Union[str, pathlib.Path]], Dict[str, List[List[str]]]]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self, phoneme_dict: Union[str, pathlib.Path, List[Union[str, pathlib.Path]], Dict[str, List[List[str]]]]
self,
phoneme_dict: Union[
str, pathlib.Path, List[Union[str, pathlib.Path, Dict[str, List[List[str]]]]], Dict[str, List[List[str]]]
],

) -> Dict[str, List[List[str]]]:
"""
parse an input IPA dictionary and save it as a dict object.
parse an input IPA dictionary (or multiple) and save it as a dict object.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
parse an input IPA dictionary (or multiple) and save it as a dict object.
Parse one or more IPA dictionaries and return a merged dict object.

Comment on lines 169 to 173
@@ -167,6 +174,14 @@ def _parse_phoneme_dict(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Args:
phoneme_dict: A single phoneme dictionary source or a list of sources for multi-dictionary
code-switching (e.g. Hindi + English). Each source can be:
- a file path (str or pathlib.Path) in CMUdict format,
e.g. ``scripts/tts_dataset_files/ipa_cmudict-0.7b_nv22.06.txt``
- a dict object with CMUdict-like entries,
e.g. ``{"Wire": [["ˈ", "w", "a", "ɪ", "ɚ"], ["ˈ", "w", "a", "ɪ", "ɹ"]]}``
When a list is provided, all sources are parsed and merged into a single dictionary.

Comment thread nemo/collections/common/tokenizers/text_to_speech/ipa_lexicon.py Outdated
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
@XuesongYang XuesongYang requested a review from blisc April 21, 2026 00:40
Copy link
Copy Markdown
Collaborator

@XuesongYang XuesongYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quapham I made some suggestions on the PR. pls apply if you feel my suggestions are correct. Thanks!

FYI, I directly made changes for unit tests in order to cover comprehensive cases.

…kenizers.py

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: quapham <quapham@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants